1 Getting ready

1.1 Installing R

Go on this link to download R: https://cran.rstudio.com/

Select the version that works for your operating system, and download the latest release (R-3.6.0).

Download R.

Figure 1.1: Download R.

Once you’ve downloaded R, install it following the instructions on the screen.

1.2 Installing R Studio

Go on this link to download R Studio: https://www.rstudio.com/products/rstudio/download/#download

And then download the version that works for your operating system.

Download R Studio.

Figure 1.2: Download R Studio.

Once you’ve downloaded R Studio, install it following the instructions on the screen.

2 Why R?

2.1 What you already know about R

R Survey results.

Figure 2.1: R Survey results.

3 Setting things up

3.1 R Studio

R Studio is a great integrated development environment (IDE) in which you can do all your R coding.

Before we get started, let’s change some of the settings in R Studio first.

General preferences.

Figure 3.1: General preferences.

Make sure that:

  • Restore .RData into workspace at startup is unselected
  • Save workspace to .RData on exit is set to Never
Code window preferences.

Figure 3.2: Code window preferences.

This makes sure that each time we run R Studio, we are starting with a fresh environment rather than still having variables saved from a previous run (which can cause trouble).

Make sure that:

  • Soft-wrap R source files is selected

This way you don’t have to scroll horizontally. At the same time, avoid writing long single lines of code. For example, instead of writing code like so:

ggplot(data = diamonds, aes(x = cut, y = price)) +
  stat_summary(fun.data = "mean_cl_boot", geom = "linerange", size = 1.5) +
  stat_summary(fun.y = "mean", geom = "bar", color = "black", fill = "lightblue", width = 0.85) +
  labs(title = "Price as a function of quality of cut", subtitle = "Note: The price is in US dollars", tag = "A", x = "Quality of the cut", y = "Price")

You may want to write it this way instead:

ggplot(data = diamonds, aes(x = cut, y = price)) +
  # display the error bars
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange",
               size = 1.5) +
    # display the means
  stat_summary(fun.y = "mean",
               geom = "bar",
               color = "black",
               fill = "lightblue",
               width = 0.85) +
  # change labels
  labs(title = "Price as a function of quality of cut",
       subtitle = "Note: The price is in US dollars", # we might want to change this later
       tag = "A",
       x = "Quality of the cut",
       y = "Price")

This makes it much easier to see what’s going on, and you can easily add comments to individual lines of code.

Here is cheatsheet with more useful information about R Studio:

3.2 Getting help

There are a few different ways to get help in R. You can either put a ? in front of the function you’d like to learn more about, or use the help() function.

?print
help("print")

Tip: To see the help file, hover over a function (or dataset) with the mouse (or select the text) and then press F1.

I recommend using F1 to get to help files – it’s the fastest way!

R help files can sometimes look a little cryptic. Most R help files have the following sections (copied from here):


Title: A one-sentence overview of the function.

Description: An introduction to the high-level objectives of the function, typically about one paragraph long.

Usage: A description of the syntax of the function (in other words, how the function is called). This is where you find all the arguments that you can supply to the function, as well as any default values of these arguments.

Arguments: A description of each argument. Usually this includes a specification of the class (for example, character, numeric, list, and so on). This section is an important one to understand, because arguments are frequently a cause of errors in R.

Details: Extended details about how the function works, provides longer descriptions of the various ways to call the function (if applicable), and a longer discussion of the arguments.

Value: A description of the class of the value returned by the function.

See also: Links to other relevant functions. In most of the R editors, you can click these links to read the Help files for these functions.

Examples: Worked examples of real R code that you can paste into your console and run.


Here is the help file for the print() function:

Help file for the print() function.

Figure 3.3: Help file for the print() function.

The help files in R are often quite cryptic and it can take some time until these are really helpful. Until then, google things! R has a very active community with a large number of posts on stackoverflow and other online forums.

3.3 Installing and maintaining packages

What makes R powerful is the large number of packages that have been written for R. You can install a new package like so:

install.packages("tidyverse")

You can also install multiple packages at the same time, by concatenating the package names using the c() function:

install.packages(c("tidyverse","broom"))

To make sure that your packages remain up to date, you can go to Tools > Check for Package Updates ... in R Studio.

Help file for the print() function.

Figure 3.4: Help file for the print() function.

You can then click Select All and then Install Updates.

Help file for the print() function.

Figure 3.5: Help file for the print() function.

R Studio might ask you to restart your R session before updating the packages.

3.4 R Markdown

R Markdown files are a great way of organizing ones code. This tutorial is written using R Markdown! Most importantly, you can put R code straight into your R Markdown file so that you can have everything in one place. Indeed, you can write a full paper in R Markdown if you like (using the package papaja).

There are two main ways of putting code into your R Markdown document. Most often, you will create a code chunk and put the code into that chunk, like so:

a = 1 + 2 
print(a)
[1] 3

You can also evaluate R code in line with other text like so: The value of a is 3.

The nice thing about these code chunks is that they show you the output directly underneath the chunk when you run it. This is also true for plots. This means you can focus on one place rather than needing to shift back and forth between multiple windows.

And a big advantage of using R Markdown is that you can render the file in different formats by “knitting” it. For example, I’ve created the “.html” file using this R Markdown file. This is a great way of sharing your code with others and contributing to open science this way.

You can also use R Markdown to build academic homepages, and to write online books.

You can find some more information about R Markdown in the cheatsheets here:

3.5 Some general advice

Before diving into R, here are a few more general tips.

3.5.1 Naming folders and files

I suggest to always use lower case characters and avoid whitespace in folder and file names. Either use "_" or “-” instead of a white space. Some programs (e.g. LaTeX) cannot deal with white spaces in file paths.

3.5.2 Always use relative paths

In your R Markdown file, make sure to always use relative paths rather than full paths. For example, you’ll see below how I import the data like so "../../data/top2018songs.csv" (relative path) rather than so "/Users/tobi/Documents/work/projects_git/r_tutorial/data/data/top2018songs.csv" (absolute path).

Using relative paths has the advantage that your collaborators can run code just like you can. If you were to use an absolute path, then your collaborator wouldn’t be able to run the file without changing the path first.

3.5.3 Naming variables, functions, etc.

Personally, I like to name things consistently so that I have no trouble finding stuff even when I open up a project that I haven’t worked on for a while. I try to use the following naming conventions:

Table 3.1: Some naming conventions I adopt to make my life easier.
name use
df.thing for data frames
l.thing for lists
fun.thing for functions
tmp.thing for temporary variables

3.5.4 Always load all packages at the top

This way, other collaborators will directly see what packages they may need to install before running the code.

3.5.5 Make sure that a script can be executed from top to bottom

For example, you don’t want it to be the case that in order to run code chunk 2, you have to run code chunk 3 first.

3.5.6 Keep your projects organized

This github repository uses a project structure that I like. I recommend keeping data, figures, and code separate. Using the same structure in different projects really helps to keep things organized, and to find things quickly.

3.5.7 Learn keyboard shortcuts!

Learning keyboard shortcuts will speed up your workflow immensely! You can view the default keyboard shortcuts here: Tools > Keyboard Shortcuts Help

You can also modify and add keyboard shortcuts via Tools > Modify Keyboard Shortcuts...

For the very eager among, you can also take a look at snippets. Snippets allow you to define code macros for pieces of code that you use often (e.g. particular kinds of plots that you like making). You can find out more about how snippets in R Studio work here.

3.5.8 Use R projects

By using R projects you make sure that the working directory is set correctly. You can then open multiple R projects at the same time without any conflicts between the projects (otherwise, you might overwrite variables from one script with the variables of another script using the same environment). For this tutorial, I’ve created the r_tutorial.Rproj file.

3.5.9 Don’t write past the vertical rule in code blocks

This way, your code will look nice when you knit your R Markdown file into a html or a pdf output.

3.5.10 Keep your code tidy

Tidy code and data sparks joy!!!

Figure 3.6: Tidy code and data sparks joy!!!

This code block here is difficult to read:

ggplot(df.plot,aes(x = money,
                      y=happiness))+geom_point()+
geom_smooth(method="lm")

This code block is much easier to read:

ggplot(data = df.plot,
       mapping = aes(x = money,
                     y = happiness)) + 
  geom_point() +
  geom_smooth(method = "lm")
  • Use consistent indentation. RStudio makes it easy to write nice code. It figures out where to put the next line of code when you press ENTER. And if things ever get messy, just select the code of interest and hit cmd + i to re-indent the code.
  • Use named arguments for functions. For example, write ggplot(data = df.plot) instead of ggplot(df.plot). Using argument names makes it easier for others to read your code. Coming from another programming language, you might not get what seq(1, 11, 2) means, and it’ll be easier to understand seq(from = 1, to = 11, by = 2) – Ah, this is a sequence from 1 to 11 in steps of 2!
  • Use white spaces between names and arguments, and around +, =, -, etc.
  • Always have a line break after + in ggplot2 or after using the pipe %>% (which we will discuss later). This makes it easier to just run parts of your code if you want to test stuff, and to comment out parts of your code, too.

Here are some more tips on how to write nice code in R:

3.6 R syntax

There are two main ways to code in R, one is called “base R” and the other is called “tidyverse”. The “tidyverse” is a collection of powerful packages that work very well with one another It’s the modern way of coding in R, and this tutorial uses the tidyverse. That said, it’s still important to know how to write things using “base R”.

This cheatsheet summarizes some of the key aspects of “base R”

3.6.1 The pipe %>%

A key part of coding in the tidyverse is using the pipe operator %>% (pronounce “then”). What’s great about the pipe operator is that it allows us to write code in the order which makes sense: first I want to do this with the data, then I want to do that, then I want to print out the result.

Let me illustrate by calculating the root mean squared error (a measure of how well your predictions fit the data). In case you’re interested, this is how RMSE is defined:

\[ \text{RMSE} = \sqrt\frac{\sum_{i=1}^n(\hat{y}_i-y_i)^2}{n} \] where \(\hat{y}_i\) denotes the prediction, and \(y_i\) the actually observed value.

And here is how to calculate it using standard base R syntax.

prediction = c(1, 3, 4, 5)
data = c(2, 3, 2, 3)

print(sqrt(mean((prediction - data)^2)))
[1] 1.5

Notice, how we have to read what this does from the inside out – that is, we need to start in the most inside part of the parenthesis the (prediction - data^2) and work our way out. Instead, a more intuitive way of writing the same thing is using the pipe operator like so:

prediction = c(1, 3, 4, 5)
data = c(2, 3, 2, 3)

(prediction - data)^2 %>% 
  mean() %>% 
  sqrt() %>% 
  print()
[1] 1.5

Instead of root-mean-squared error, it should really be called squared-mean-root error!

Abstractly, the pipe operator does the following:

f(x) can be rewritten as x %>% f()

It takes the output of a previous computation, and inserts this output as the first argument into the next computation. You can learn more about how the pipe works here https://r4ds.had.co.nz/pipes.html.

Pro Tip: The keyboard shortcut for the pipe is cmd + shit + m

4 Doing stuff

4.1 Loading packages

The order in which packages in R are loaded matters!

library("tidyverse")
library("MASS")

versus

library("MASS")
library("tidyverse")

Both the MASS package and the tidyverse packages have a function called select(). In R, whichever package is loaded later, overwrites the functions of earlier loaded packages with the same name.

You can refer to functions from specific packages by adding the package name at the beginning. For example, this command would use the select() function from the MASS package MASS::select(), while this command would use the function from the dplyr package dplyr::select() (irrespective in which order you’ve loaded the packages). However, adding the package name to a function each time it’s called is cumbersome. That’s why we want to make sure to load the packages whose functions we use most frequently last.

In particular, I’d suggest to always load library("tidyverse") last because it loads a large number of often used functions.

4.2 Importing data

We can import a comma-separated-value (csv) file like so (you can ignore the mutate() part for now):

df.data = read_csv(file = "../../data/top2018songs.csv") %>% 
  mutate(rank = 1:nrow(.))
Table 4.1: Description of the different columns in the data frame.
column description
id Spotify URI of the song
name Name of the song
artists Artist(s) of the song
danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness Predicts whether a track contains no vocals. ‘Ooh’ and ‘aah’ sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly ‘vocal’. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms The duration of the track in milliseconds.
time_signature An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

The quickest way to take a look at your data is to hover your mouse over a variable of a data frame, and press F2.

Let’s take a look at the top of the data frame:

df.data %>% 
  print()
# A tibble: 100 x 17
   id    name  artists danceability energy   key loudness  mode speechiness
   <chr> <chr> <chr>          <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
 1 6DCZ… God'… Drake          0.754  0.449     7    -9.21     1      0.109 
 2 3ee8… SAD!  XXXTEN…        0.74   0.613     8    -4.88     1      0.145 
 3 0e7i… rock… Post M…        0.587  0.535     5    -6.09     0      0.0898
 4 3swc… Psyc… Post M…        0.739  0.559     8    -8.01     1      0.117 
 5 2G7V… In M… Drake          0.835  0.626     1    -5.83     1      0.125 
 6 7dt6… Bett… Post M…        0.68   0.563    10    -5.84     1      0.0454
 7 58q2… I Li… Cardi B        0.816  0.726     5    -4.00     0      0.129 
 8 7ef4… One … Calvin…        0.791  0.862     9    -3.24     0      0.11  
 9 76cy… IDGAF Dua Li…        0.836  0.544     7    -5.98     1      0.0943
10 08bN… FRIE… Marshm…        0.626  0.88      9    -2.38     0      0.0504
# … with 90 more rows, and 8 more variables: acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>, time_signature <dbl>, rank <int>

Here is a cheatsheet with more information about how to import data into R:

4.3 Data visualiztion

4.3.1 How not to visualize data

We should always take a look at the data first.

A not so good plot.

Figure 4.1: A not so good plot.

Another could-be-improved plot.

Figure 4.2: Another could-be-improved plot.

This second plots reminded me of the following:

Correlation is not causation.

Figure 4.3: Correlation is not causation.

Just because two lines look similar, doesn’t mean that anything interesting is going on – it certainly doesn’t mean that the two phenomena represented by the lines are causally connected. For more inspiration check out this site https://www.tylervigen.com/spurious-correlations.

4.3.2 Why you should always visualize your data first

__The Datasaurus Dozen__. While different in appearance, each dataset has the same summary statistics to two decimal places (mean, standard deviation, and Pearson's correlation).

Figure 4.4: The Datasaurus Dozen. While different in appearance, each dataset has the same summary statistics to two decimal places (mean, standard deviation, and Pearson’s correlation).

The data sets in Figure 4.4 all share the same summary statistics. Clearly, the data sets are not the same though.

Tip: Always plot the data first!

Here is the paper from which I took Figure 4.4. It explains how the figures were generated and shows more examples for how summary statistics and some kinds of plots are insufficient to get a good sense for what’s going on in the data.

Boxplots can be misleading.

Figure 4.5: Boxplots can be misleading.

4.3.3 Visualizing data using ggplot2

ggplot2 defines a grammar of graphics. One of the great things is that you can make a variety of different kinds of plots without ever having to change your data frame.

Here is how you would make a scatter plot:

ggplot(data = df.data,
       mapping = aes(x = danceability,
                     y = valence)) + 
  geom_point()

Adding a best-fitting linear regression line to the scatter plot is simple:

ggplot(data = df.data,
       mapping = aes(x = danceability,
                     y = valence)) + 
  geom_point() +
  geom_smooth(method = "lm")

Here is a more involved plot that shows some of the things you can do with ggplot2:

df.plot = df.data %>% 
  mutate(mode = factor(mode,
                       levels = c(0, 1),
                       labels = c("minor", "major")),
         key = factor(key,
                      levels = 0:11,
                      labels = c("C", "C#", "D", "D#",
                                 "E", "F", "F#", "G",
                                 "G#", "A", "A#", "B")))

ggplot(data = df.plot,
       mapping = aes(x = key,
                     y = energy,
                     group = mode,
                     fill = mode)) + 
  # add individual data points 
  geom_point(mapping = aes(color = mode),
             position = position_jitterdodge(dodge.width = 0.7,
                                             jitter.width = 0.1,
                                             jitter.height = 0),
             alpha = 0.3) + 
  # add means with error bars 
  stat_summary(fun.data = "mean_cl_boot",
               geom = "pointrange",
               position = position_dodge(width = 0.7),
               size = 0.75,
               shape = 21) +
  # add the vertical lines
  geom_vline(data = tibble(key = 1:10), 
             xintercept = seq(from = 1.5, to = 11.5, by = 1),
             linetype = 2,
             color = "gray80") + 
  # set title and subtitle of plot 
  labs(title = "Energy for songs with different key and mode",
       subtitle = "Energy represents a perceptual measure of intensity and activity.") + 
  # change the y-axis 
  scale_y_continuous(breaks = seq(0.25, 1, 0.25),
                     labels = seq(0.25, 1, 0.25),
                     limits = c(0.25, 1)) +
  # set the fill color 
  scale_fill_brewer(palette = "Set1") +
  # change the plotting theme
  theme_classic() +
  # adjust the text size
  theme(text = element_text(size = 16),
        plot.subtitle = element_text(size = 12))

# let's save the figure
ggsave(filename = "../../figures/plots/energy_key_mode.pdf",
       width = 8,
       height = 6)

Here are some cheatsheets with data visualization info:

4.3.3.1 Practice time

Make a scatter plot that shows energy on the x-axis and tempo on the y-axis.

# write your code here 

Play around with the scatter plot that you’ve just created by incorporating some of the elements I’ve used in the more complex plot above. For example, you could try the following:

  • change the size of the points
  • change the color of the points
  • change the text of the x-axis and y-axis title
  • add a regression line
  • add a horizontal line that intersects the y-axis at 100
  • add color = mode to the aes() function and figure out what this does
# write your code here

4.4 Data manipulation

Visualizing data is fun! But often, we need to spend quite a bit of time beating data into the right shape first. We want our data to be tidy – a tidy data frame has one row per observation. Once we have a tidy data frame, plotting things using ggplot2 becomes a breeze. Unfortunately, many data files aren’t tidy at all to start off with. For example, if you use Qualtrics to run your experiment, the data output will be far from tidy. So we have to learn how to beat our data into shape.

4.4.1 Data transformation

We often want to do things our data frame such as filter out certain observations, select a subset of the columns, rename variables, sort the rows, create new variables, and summarize the data in different ways. Here, we’ll take a quick look at these data transformations in R.

4.4.1.1 filter

Let’s filter out only the songs by the artist “Drake”.

df.data %>% 
  filter(artists == "Drake")
# A tibble: 4 x 17
  id    name  artists danceability energy   key loudness  mode speechiness
  <chr> <chr> <chr>          <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
1 6DCZ… God'… Drake          0.754  0.449     7    -9.21     1      0.109 
2 2G7V… In M… Drake          0.835  0.626     1    -5.83     1      0.125 
3 3CA9… Nice… Drake          0.586  0.909     8    -6.47     1      0.0705
4 0TlL… Nons… Drake          0.912  0.412     7    -8.07     1      0.124 
# … with 8 more variables: acousticness <dbl>, instrumentalness <dbl>,
#   liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>,
#   time_signature <dbl>, rank <int>

We can add multiple filters like so:

df.data %>% 
  filter(artists == "Drake" & danceability > 0.8)
# A tibble: 2 x 17
  id    name  artists danceability energy   key loudness  mode speechiness
  <chr> <chr> <chr>          <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
1 2G7V… In M… Drake          0.835  0.626     1    -5.83     1       0.125
2 0TlL… Nons… Drake          0.912  0.412     7    -8.07     1       0.124
# … with 8 more variables: acousticness <dbl>, instrumentalness <dbl>,
#   liveness <dbl>, valence <dbl>, tempo <dbl>, duration_ms <dbl>,
#   time_signature <dbl>, rank <int>

4.4.1.2 select()

Let’s say we are only interested in a subset of the columns. We can use select() to do so:

df.data %>% 
  select(name, artists, rank)
# A tibble: 100 x 3
   name                         artists        rank
   <chr>                        <chr>         <int>
 1 God's Plan                   Drake             1
 2 SAD!                         XXXTENTACION      2
 3 rockstar (feat. 21 Savage)   Post Malone       3
 4 Psycho (feat. Ty Dolla $ign) Post Malone       4
 5 In My Feelings               Drake             5
 6 Better Now                   Post Malone       6
 7 I Like It                    Cardi B           7
 8 One Kiss (with Dua Lipa)     Calvin Harris     8
 9 IDGAF                        Dua Lipa          9
10 FRIENDS                      Marshmello       10
# … with 90 more rows

We can also deselect variables like so:

df.data %>% 
  select(-id)
# A tibble: 100 x 16
   name  artists danceability energy   key loudness  mode speechiness
   <chr> <chr>          <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
 1 God'… Drake          0.754  0.449     7    -9.21     1      0.109 
 2 SAD!  XXXTEN…        0.74   0.613     8    -4.88     1      0.145 
 3 rock… Post M…        0.587  0.535     5    -6.09     0      0.0898
 4 Psyc… Post M…        0.739  0.559     8    -8.01     1      0.117 
 5 In M… Drake          0.835  0.626     1    -5.83     1      0.125 
 6 Bett… Post M…        0.68   0.563    10    -5.84     1      0.0454
 7 I Li… Cardi B        0.816  0.726     5    -4.00     0      0.129 
 8 One … Calvin…        0.791  0.862     9    -3.24     0      0.11  
 9 IDGAF Dua Li…        0.836  0.544     7    -5.98     1      0.0943
10 FRIE… Marshm…        0.626  0.88      9    -2.38     0      0.0504
# … with 90 more rows, and 8 more variables: acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>, time_signature <dbl>, rank <int>

Now we have a data frame that has all the columns except for the id column.

4.4.1.3 rename()

Renaming variables is simple!

df.data %>% 
  rename(song = name,
         singer = artists)
# A tibble: 100 x 17
   id    song  singer danceability energy   key loudness  mode speechiness
   <chr> <chr> <chr>         <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
 1 6DCZ… God'… Drake         0.754  0.449     7    -9.21     1      0.109 
 2 3ee8… SAD!  XXXTE…        0.74   0.613     8    -4.88     1      0.145 
 3 0e7i… rock… Post …        0.587  0.535     5    -6.09     0      0.0898
 4 3swc… Psyc… Post …        0.739  0.559     8    -8.01     1      0.117 
 5 2G7V… In M… Drake         0.835  0.626     1    -5.83     1      0.125 
 6 7dt6… Bett… Post …        0.68   0.563    10    -5.84     1      0.0454
 7 58q2… I Li… Cardi…        0.816  0.726     5    -4.00     0      0.129 
 8 7ef4… One … Calvi…        0.791  0.862     9    -3.24     0      0.11  
 9 76cy… IDGAF Dua L…        0.836  0.544     7    -5.98     1      0.0943
10 08bN… FRIE… Marsh…        0.626  0.88      9    -2.38     0      0.0504
# … with 90 more rows, and 8 more variables: acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>, time_signature <dbl>, rank <int>

4.4.1.4 arrange()

Let’s rearrange the rows of the data frame to show the most danceable song first (since all we really care about is danceability!!).

df.data %>% 
  arrange(desc(danceability))
# A tibble: 100 x 17
   id    name  artists danceability energy   key loudness  mode speechiness
   <chr> <chr> <chr>          <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
 1 6vN7… Yes … Lil Ba…        0.964  0.346     5    -9.31     0      0.53  
 2 2E12… FEFE… 6ix9ine        0.931  0.387     1    -9.13     1      0.412 
 3 4qKc… Look… BlocBo…        0.922  0.581    10    -7.50     1      0.27  
 4 0JP9… Moon… XXXTEN…        0.921  0.537     9    -5.72     0      0.0804
 5 0TlL… Nons… Drake          0.912  0.412     7    -8.07     1      0.124 
 6 6n4U… Walk… Migos          0.909  0.628     2    -5.46     1      0.201 
 7 3xcC… Bella Wolfine        0.909  0.493     3    -6.69     1      0.0735
 8 7KXj… HUMB… Kendri…        0.908  0.621     1    -6.64     0      0.102 
 9 3V8U… Te B… Nio Ga…        0.903  0.675    11    -3.44     0      0.214 
10 5IaH… Tast… Tyga           0.884  0.559     0    -7.44     1      0.12  
# … with 90 more rows, and 8 more variables: acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>, time_signature <dbl>, rank <int>

Note how I’ve used the desc() function here to arrange the data frame in descending order. To sort the data frame starting with the least danceable song, we would simply do:

df.data %>% 
  arrange(danceability)
# A tibble: 100 x 17
   id    name  artists danceability energy   key loudness  mode speechiness
   <chr> <chr> <chr>          <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
 1 1j4k… Dusk… ZAYN           0.258  0.437    11    -6.59     0      0.039 
 2 2xGj… This… Keala …        0.284  0.704     2    -7.28     1      0.186 
 3 0u2P… love… Billie…        0.351  0.296     4   -10.1      0      0.0333
 4 1gm6… Call… The We…        0.489  0.598     1    -4.93     1      0.036 
 5 0s3n… Luci… Juice …        0.511  0.566     6    -7.23     0      0.2   
 6 7vGu… Sile… Marshm…        0.52   0.761     4    -3.09     1      0.0853
 7 5WvA… No B… DJ Kha…        0.552  0.76      0    -4.71     1      0.342 
 8 3EPX… Be A… Dean L…        0.553  0.586    11    -6.32     1      0.0362
 9 75Zv… I Fa… Post M…        0.556  0.538     8    -5.41     0      0.0382
10 0d2i… East… benny …        0.56   0.68      6    -7.65     0      0.321 
# … with 90 more rows, and 8 more variables: acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>, time_signature <dbl>, rank <int>

The DJ better not play “Dusk Till Dawn - Radio Edit” the next time I go out!

4.4.1.5 mutate()

We can create new variables using mutate().

df.data %>% 
  mutate(dance_energy = danceability + energy)
# A tibble: 100 x 18
   id    name  artists danceability energy   key loudness  mode speechiness
   <chr> <chr> <chr>          <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
 1 6DCZ… God'… Drake          0.754  0.449     7    -9.21     1      0.109 
 2 3ee8… SAD!  XXXTEN…        0.74   0.613     8    -4.88     1      0.145 
 3 0e7i… rock… Post M…        0.587  0.535     5    -6.09     0      0.0898
 4 3swc… Psyc… Post M…        0.739  0.559     8    -8.01     1      0.117 
 5 2G7V… In M… Drake          0.835  0.626     1    -5.83     1      0.125 
 6 7dt6… Bett… Post M…        0.68   0.563    10    -5.84     1      0.0454
 7 58q2… I Li… Cardi B        0.816  0.726     5    -4.00     0      0.129 
 8 7ef4… One … Calvin…        0.791  0.862     9    -3.24     0      0.11  
 9 76cy… IDGAF Dua Li…        0.836  0.544     7    -5.98     1      0.0943
10 08bN… FRIE… Marshm…        0.626  0.88      9    -2.38     0      0.0504
# … with 90 more rows, and 9 more variables: acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   duration_ms <dbl>, time_signature <dbl>, rank <int>,
#   dance_energy <dbl>

Let’s take a look at the song with the most combined danceability and energy:

df.data %>% 
  mutate(dance_energy = danceability + energy) %>% 
  select(name, artists, dance_energy) %>% 
  arrange(desc(dance_energy))
# A tibble: 100 x 3
   name                                       artists          dance_energy
   <chr>                                      <chr>                   <dbl>
 1 1, 2, 3 (feat. Jason Derulo & De La Ghett… Sofia Reyes              1.69
 2 One Kiss (with Dua Lipa)                   Calvin Harris            1.65
 3 Dura                                       Daddy Yankee             1.64
 4 Taki Taki (with Selena Gomez, Ozuna & Car… DJ Snake                 1.64
 5 Stir Fry                                   Migos                    1.63
 6 Criminal                                   Natti Natasha            1.63
 7 ?chame La Culpa                            Luis Fonsi               1.62
 8 Feel It Still                              Portugal. The M…         1.60
 9 Jackie Chan                                Ti?sto                   1.58
10 Te Bot? - Remix                            Nio Garcia               1.58
# … with 90 more rows

Sofia Reyes wins!

4.4.1.6 group_by() and summarize()

Grouping and summarizing is a very powerful combination! For example, let’s say that we are interested in what the average rank of each artist is who had more than one song in the top 100. Here is how we could go about it.

First, I group the data frame by the artists variable, and then I summarize what information I would like by group. Here, I calculate the mean rank, the standard deviation of the rank, and the number of hits (using the n() function) per artist. I then filter out only those artists who had more than 1 hit in the top 100, and arrange the data frame starting with the artists with the most hits.

df.data %>% 
  group_by(artists) %>% 
  summarize(mean_rank = mean(rank),
            sd_rank = sd(rank),
            n_hits = n()) %>% 
  filter(n_hits > 1) %>% 
  arrange(desc(n_hits)) %>% 
  ungroup()
# A tibble: 18 x 4
   artists         mean_rank sd_rank n_hits
   <chr>               <dbl>   <dbl>  <int>
 1 Post Malone          33.2   35.4       6
 2 XXXTENTACION         41.2   33.3       6
 3 Drake                20.2   28.3       4
 4 Ed Sheeran           47     33.0       3
 5 Marshmello           42.3   28.7       3
 6 Ariana Grande        41.5   34.6       2
 7 Calvin Harris        49.5   58.7       2
 8 Camila Cabello       24     18.4       2
 9 Clean Bandit         64.5   46.0       2
10 Dua Lipa             17     11.3       2
11 Imagine Dragons      49.5    7.78      2
12 Kendrick Lamar       49.5   47.4       2
13 Khalid               57     42.4       2
14 Maroon 5             41.5   38.9       2
15 Migos                78      5.66      2
16 Ozuna                86      2.83      2
17 Selena Gomez         43      7.07      2
18 The Weeknd           61.5   16.3       2

Looks like Post Malone was killing it in 2018!

Here is more information about how to transform your data:

4.4.2 Data wrangling

Beating data into the shape we’d like it to be can be frustrating. So it’s good practice to learn how to do it, so that you can get to the fun stuff as quickly as possible (such as making cool looking plots!).

Unfortunately, we won’t have the time to look into data wrangling in this tutorial. Here is a table of the data manipulation verbs that you want to check out and play around with:

Table 4.2: Important data wrangling verbs to check out.
verb description
gather() transform a data frame from wide to long format
spread() transform a data frame from long to wide format
unite() unite multiple columns into one
separate() separate a single column into several columns
left_join() combine information from multiple data frames into one

Here is the data wrangling cheatsheet (data wrangling will take some time to get familiar with):

4.4.3 Practice

What was the longest song in the Spotify top 100 of 2018?

# write your code here

What was the mean liveliness of all songs by Drake?

# write your code here

4.5 Statistics

As we’ve seen, R is great for plotting and data wrangling. It’s also great for doing statistics! Again, We won’t have the time to go into it in this class. Most of your statistical needs will be met by the following functions:

  • Linear model lm(): for when you have independent observations.
  • Linear mixed effects models lmer() (using library("lme4")): for when your data points aren’t independent (e.g. when you have repeated observations from the same participants in your experiment).
  • Bayesian models brm() (using library("brms")): if you’d like to try out some Bayesian data analysis.

5 Help others help you

This man doesn't look particularly helpful!

Figure 5.1: This man doesn’t look particularly helpful!

The best way to help others help you is by making a reproducible example (also called “reprex”). The “reprex” package makes it easy to generate a reproducible example that you can then share with others.

You can install the package like so:

install.packages("reprex")

Now, you have a new RStudio add-in that you can use for making reproducible examples. First, select that code that you want to use for generating the example:

library("ggplot2")

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = cty)) + 
  geom_bar()

Then go to Tools > Addins > Browse Addins and select Reprex selection (see Figure):

You’ll get the following output of running reprex on this code, which you can then email to your colleague, or share on stackoverflow when posting a question.

library("ggplot2")

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = cty)) + 
  geom_bar()
#> Error: stat_count() must not be used with a y aesthetic.

Using reprex, you will make sure that the other person will be able to recreate the error message that you got (because it runs the code with a clear environment – i.e. without any packages already loaded, are variables that you may have stored in your environment).

For example, if you were to run reprex on this piece of code …

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = cty)) + 
  geom_bar()

… the output would be the following:

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = cty)) + 
  geom_bar()
#> Error in ggplot(data = mpg, mapping = aes(x = class, y = cty)): could not find function "ggplot"

You can learn more about the reprex package here: https://github.com/tidyverse/reprex

6 Where can I learn more?

Here is a list of excellent free online books that you should check out!

And of course, google and stack overflow will be your best friends when figuring stuff out!

7 Session information

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.1     purrr_0.3.2    
 [5] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.0  
 [9] tidyverse_1.2.1 knitr_1.23     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1          lubridate_1.7.4     lattice_0.20-38    
 [4] assertthat_0.2.1    zeallot_0.1.0       digest_0.6.19      
 [7] utf8_1.1.4          R6_2.4.0            cellranger_1.1.0   
[10] backports_1.1.4     acepack_1.4.1       evaluate_0.14      
[13] httr_1.4.0          highr_0.8           pillar_1.4.1       
[16] rlang_0.3.4         lazyeval_0.2.2      readxl_1.3.1       
[19] data.table_1.12.2   rstudioapi_0.10     rpart_4.1-15       
[22] Matrix_1.2-17       checkmate_1.9.3     rmarkdown_1.13     
[25] labeling_0.3        splines_3.6.0       foreign_0.8-71     
[28] htmlwidgets_1.3     munsell_0.5.0       broom_0.5.2        
[31] compiler_3.6.0      modelr_0.1.4        xfun_0.7           
[34] pkgconfig_2.0.2     base64enc_0.1-3     htmltools_0.3.6    
[37] nnet_7.3-12         tidyselect_0.2.5    htmlTable_1.13.1   
[40] gridExtra_2.3       bookdown_0.11       Hmisc_4.2-0        
[43] fansi_0.4.0         crayon_1.3.4        withr_2.1.2        
[46] grid_3.6.0          nlme_3.1-140        jsonlite_1.6       
[49] gtable_0.3.0        magrittr_1.5        scales_1.0.0       
[52] cli_1.1.0           stringi_1.4.3       latticeExtra_0.6-28
[55] xml2_1.2.0          vctrs_0.1.0         generics_0.0.2     
[58] Formula_1.2-3       RColorBrewer_1.1-2  tools_3.6.0        
[61] glue_1.3.1          hms_0.4.2           jpeg_0.1-8         
[64] survival_2.44-1.1   yaml_2.2.0          colorspace_1.4-1   
[67] cluster_2.0.9       rvest_0.3.4         haven_2.1.0